========================================================


Introduction to Dataset

This analysis is on arabica coffee which accounts for 60% of the world’s coffee production. The dataset contains measures of the quality of individual coffee samples.

Dataset Source

These datasets are gathered from Coffee Quality Institute (CQI) in January, 2018. Data website: https://www.kaggle.com/volpatto/coffee-quality-database-from-cqi

Getting Overall View of the Dataset

The over goal is to review the dataset and see if I can discover some commonalities with low or high ranking coffee beans and make some correlations as to the correct scenario for delicious coffee!

There are 1311 rows of data and 44 columns

## [1] 1311   44
## 'data.frame':    1311 obs. of  44 variables:
##  $ X                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Species              : chr  "Arabica" "Arabica" "Arabica" "Arabica" ...
##  $ Owner                : chr  "metad plc" "metad plc" "grounds for health admin" "yidnekachew dabessa" ...
##  $ Country.of.Origin    : chr  "Ethiopia" "Ethiopia" "Guatemala" "Ethiopia" ...
##  $ Farm.Name            : chr  "metad plc" "metad plc" "san marcos barrancas \"san cristobal cuch" "yidnekachew dabessa coffee plantation" ...
##  $ Lot.Number           : chr  "" "" "" "" ...
##  $ Mill                 : chr  "metad plc" "metad plc" "" "wolensu" ...
##  $ ICO.Number           : chr  "2014/2015" "2014/2015" "" "" ...
##  $ Company              : chr  "metad agricultural developmet plc" "metad agricultural developmet plc" "" "yidnekachew debessa coffee plantation" ...
##  $ Altitude             : chr  "1950-2200" "1950-2200" "1600 - 1800 m" "1800-2200" ...
##  $ Region               : chr  "guji-hambela" "guji-hambela" "" "oromia" ...
##  $ Producer             : chr  "METAD PLC" "METAD PLC" "" "Yidnekachew Dabessa Coffee Plantation" ...
##  $ Number.of.Bags       : int  300 300 5 320 300 100 100 300 300 50 ...
##  $ Bag.Weight           : chr  "60 kg" "60 kg" "1" "60 kg" ...
##  $ In.Country.Partner   : chr  "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
##  $ Harvest.Year         : chr  "2014" "2014" "" "2014" ...
##  $ Grading.Date         : chr  "April 4th, 2015" "April 4th, 2015" "May 31st, 2010" "March 26th, 2015" ...
##  $ Owner.1              : chr  "metad plc" "metad plc" "Grounds for Health Admin" "Yidnekachew Dabessa" ...
##  $ Variety              : chr  "" "Other" "Bourbon" "" ...
##  $ Processing.Method    : chr  "Washed / Wet" "Washed / Wet" "" "Natural / Dry" ...
##  $ Aroma                : num  8.67 8.75 8.42 8.17 8.25 8.58 8.42 8.25 8.67 8.08 ...
##  $ Flavor               : num  8.83 8.67 8.5 8.58 8.5 8.42 8.5 8.33 8.67 8.58 ...
##  $ Aftertaste           : num  8.67 8.5 8.42 8.42 8.25 8.42 8.33 8.5 8.58 8.5 ...
##  $ Acidity              : num  8.75 8.58 8.42 8.42 8.5 8.5 8.5 8.42 8.42 8.5 ...
##  $ Body                 : num  8.5 8.42 8.33 8.5 8.42 8.25 8.25 8.33 8.33 7.67 ...
##  $ Balance              : num  8.42 8.42 8.42 8.25 8.33 8.33 8.25 8.5 8.42 8.42 ...
##  $ Uniformity           : num  10 10 10 10 10 10 10 10 9.33 10 ...
##  $ Clean.Cup            : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ Sweetness            : num  10 10 10 10 10 10 10 9.33 9.33 10 ...
##  $ Cupper.Points        : num  8.75 8.58 9.25 8.67 8.58 8.33 8.5 9 8.67 8.5 ...
##  $ Total.Cup.Points     : num  90.6 89.9 89.8 89 88.8 ...
##  $ Moisture             : num  0.12 0.12 0 0.11 0.12 0.11 0.11 0.03 0.03 0.1 ...
##  $ Category.One.Defects : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Quakers              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Color                : chr  "Green" "Green" "" "Green" ...
##  $ Category.Two.Defects : int  0 1 0 2 2 1 0 0 0 4 ...
##  $ Expiration           : chr  "April 3rd, 2016" "April 3rd, 2016" "May 31st, 2011" "March 25th, 2016" ...
##  $ Certification.Body   : chr  "METAD Agricultural Development plc" "METAD Agricultural Development plc" "Specialty Coffee Association" "METAD Agricultural Development plc" ...
##  $ Certification.Address: chr  "309fcf77415a3661ae83e027f7e5f05dad786e44" "309fcf77415a3661ae83e027f7e5f05dad786e44" "36d0d00a3724338ba7937c52a378d085f2172daa" "309fcf77415a3661ae83e027f7e5f05dad786e44" ...
##  $ Certification.Contact: chr  "19fef5a731de2db57d16da10287413f5f99bc2dd" "19fef5a731de2db57d16da10287413f5f99bc2dd" "0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660" "19fef5a731de2db57d16da10287413f5f99bc2dd" ...
##  $ unit_of_measurement  : chr  "m" "m" "m" "m" ...
##  $ altitude_low_meters  : num  1950 1950 1600 1800 1950 ...
##  $ altitude_high_meters : num  2200 2200 1800 2200 2200 NA NA 1700 1700 1850 ...
##  $ altitude_mean_meters : num  2075 2075 1700 2000 2075 ...

Changing Data to Eliminate Certain Columns There are cetain columns that are not needed in my analysis either because there are too many empty values or because it’s the same data twice or because it doesn’t really tell me much. So I will change my dataset to remove those columns to narrow things down.

Now it’s 24 columns.


Univariate Plots Section

In this section I will be performing preliminary exploration of the dataset. I will run some summaries of the data, clean the data, and create some plots to understand the structure of my variables.

Cleaning up the Data to Prepare for Plotting

## [1] FALSE

Cleaning the farm

I am interested in the farms, countries, and harvest year, but before I try to plot that, I am curious as to the number of unique values.

There are 558 unique farm names, There are 37 unique countries,
There are 47 unique harvest years, There are 557 unique expiration dates.

## [1] 558
## [1] 37
## [1] 47
## [1] 557

It’s time to do some cleaning. I’m noticing a lot of inconsistent data in Harvest.Year that won’t be good for plotting. Expiration has string dates, and I’d really like years in numbers. In addition, I’d also like to have NAs rather than blanks so I can filter them out later.

Being that Harvest.Year in some cases had a range of year dates, like 2016-2017, and inconsistently, I’d like to just make the Harvest.Year the beginning Harvest Year. So I’m going to change the column name to be Harvest.Year.Begin.

The data is so inconsistent, that it’s hard to find an automated formula that minimizes my time. So, being that the dataset is fairly small, I decided to manually change all the value(s).

## 
## 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 
##    2   20   30   36  352  199  245  153  129   87    1

Expiration I’d like to change to year, rather than month, day and year in a string format. Since the data was more consistent it was easier to clean.

## 
## 018\n  2011  2012  2013  2014  2015  2016  2017  2018  2019 
##     1    50    83   314   153   246   195   123   140     6
##     Sample.ID Country.of.Origin          Farm.Name     Mill
## 962       962            Brazil fazendas klem ltda dry mill
##                     Region Harvest.Year.Begin Variety Processing.Method Aroma
## 962 matas de minas                       <NA>  Catuai     Natural / Dry   7.5
##     Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup Sweetness
## 962   7.58        7.5     7.5 7.75    7.83       9.33      9.33      9.33
##     Total.Cup.Points Moisture Category.One.Defects Color Category.Two.Defects
## 962            81.33     0.11                    0 Green                    0
##     Expiration altitude_mean_meters
## 962      018\n                 1100

Now the year fields gets changed to number fields

##  num [1:1311] 2014 2014 NA 2014 2014 ...
##  num [1:1311] 2016 2016 2011 2016 2016 ...

Visualizations and stats preliminary investigation

Information about dataset The dataset has 1311 observations.

To get an overall summary of the data.
What stands out is that the Harvest Year for this dataset starts at 2008, and ends at 2018. Also, most of the data is filled out, and the NAs really are just with the timeframe variables.

##    Sample.ID      Country.of.Origin   Farm.Name             Mill          
##  Min.   :   1.0   Length:1311        Length:1311        Length:1311       
##  1st Qu.: 328.5   Class :character   Class :character   Class :character  
##  Median : 656.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 656.0                                                           
##  3rd Qu.: 983.5                                                           
##  Max.   :1312.0                                                           
##                                                                           
##     Region          Harvest.Year.Begin   Variety          Processing.Method 
##  Length:1311        Min.   :2008       Length:1311        Length:1311       
##  Class :character   1st Qu.:2012       Class :character   Class :character  
##  Mode  :character   Median :2013       Mode  :character   Mode  :character  
##                     Mean   :2014                                            
##                     3rd Qu.:2015                                            
##                     Max.   :2018                                            
##                     NA's   :57                                              
##      Aroma           Flavor        Aftertaste       Acidity     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:7.420   1st Qu.:7.330   1st Qu.:7.250   1st Qu.:7.330  
##  Median :7.580   Median :7.580   Median :7.420   Median :7.500  
##  Mean   :7.564   Mean   :7.518   Mean   :7.398   Mean   :7.533  
##  3rd Qu.:7.750   3rd Qu.:7.750   3rd Qu.:7.580   3rd Qu.:7.750  
##  Max.   :8.750   Max.   :8.830   Max.   :8.670   Max.   :8.750  
##                                                                 
##       Body          Balance        Uniformity       Clean.Cup     
##  Min.   :0.000   Min.   :0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:7.330   1st Qu.:7.330   1st Qu.:10.000   1st Qu.:10.000  
##  Median :7.500   Median :7.500   Median :10.000   Median :10.000  
##  Mean   :7.518   Mean   :7.518   Mean   : 9.833   Mean   : 9.833  
##  3rd Qu.:7.670   3rd Qu.:7.750   3rd Qu.:10.000   3rd Qu.:10.000  
##  Max.   :8.580   Max.   :8.750   Max.   :10.000   Max.   :10.000  
##                                                                   
##    Sweetness      Total.Cup.Points    Moisture       Category.One.Defects
##  Min.   : 0.000   Min.   : 0.00    Min.   :0.00000   Min.   : 0.0000     
##  1st Qu.:10.000   1st Qu.:81.17    1st Qu.:0.09000   1st Qu.: 0.0000     
##  Median :10.000   Median :82.50    Median :0.11000   Median : 0.0000     
##  Mean   : 9.903   Mean   :82.12    Mean   :0.08886   Mean   : 0.4264     
##  3rd Qu.:10.000   3rd Qu.:83.67    3rd Qu.:0.12000   3rd Qu.: 0.0000     
##  Max.   :10.000   Max.   :90.58    Max.   :0.28000   Max.   :31.0000     
##                                                                          
##     Color           Category.Two.Defects   Expiration   altitude_mean_meters
##  Length:1311        Min.   : 0.000       Min.   :2011   Min.   :     1      
##  Class :character   1st Qu.: 0.000       1st Qu.:2013   1st Qu.:  1100      
##  Mode  :character   Median : 2.000       Median :2015   Median :  1311      
##                     Mean   : 3.592       Mean   :2015   Mean   :  1784      
##                     3rd Qu.: 4.000       3rd Qu.:2016   3rd Qu.:  1600      
##                     Max.   :55.000       Max.   :2019   Max.   :190164      
##                                          NA's   :1      NA's   :227

Let’s look at where the coffee comes from.

Most of the Arabica coffee in this dataset comes from Mexico. 236 of the samples come from Mexico. That’s 18% of the coffee.

## 
##                       Brazil                      Burundi 
##                          132                            2 
##                        China                     Colombia 
##                           16                          183 
##                   Costa Rica                Cote d?Ivoire 
##                           51                            1 
##                      Ecuador                  El Salvador 
##                            1                           21 
##                     Ethiopia                    Guatemala 
##                           44                          181 
##                        Haiti                     Honduras 
##                            6                           53 
##                        India                    Indonesia 
##                            1                           20 
##                        Japan                        Kenya 
##                            1                           25 
##                         Laos                       Malawi 
##                            3                           11 
##                    Mauritius                       Mexico 
##                            1                          236 
##                      Myanmar                    Nicaragua 
##                            8                           26 
##                       Panama             Papua New Guinea 
##                            4                            1 
##                         Peru                  Philippines 
##                           10                            5 
##                       Rwanda                       Taiwan 
##                            1                           75 
## Tanzania, United Republic Of                     Thailand 
##                           40                           32 
##                       Uganda                United States 
##                           26                            8 
##       United States (Hawaii)  United States (Puerto Rico) 
##                           73                            4 
##                      Vietnam                       Zambia 
##                            7                            1
## [1] "18%"

Looking at Harvest Year from 2008 - 2018:

The max amount of production in coffee appeared in 2012. The outliers are 2008 and 2018. I’m assuming because the data collection occurred in the middle of those years.

## 
## 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 
##    2   20   30   36  352  199  245  153  129   87    1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2008    2012    2013    2014    2015    2018      57

Looking at Expiration Dates - Most coffee is set to expire in 2013:

After faceting on harvest year, and knowing the majority of coffee production happened in 2012, it appears it expires 1 year later, which would make sense at 2013 being the most frequent expiration date.

The summary below seems to support that assumption of expiration dates.

## coffeedata$Harvest.Year.Begin: 2008
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2011    2011    2011    2011    2011    2011 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2009
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2011    2011    2011    2011    2011    2012 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2010
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2011    2011    2012    2012    2012    2012 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2011
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2012    2012    2012    2012    2012    2013 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2012
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2013    2013    2013    2013    2013    2015 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2013
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2014    2014    2014    2014    2015    2015 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2014
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2015    2015    2015    2015    2016    2017 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2015
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2016    2016    2016    2016    2017    2018 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2016
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2016    2017    2017    2017    2018    2018 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2017
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2018    2018    2018    2018    2018    2019 
## ------------------------------------------------------------ 
## coffeedata$Harvest.Year.Begin: 2018
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2019    2019    2019    2019    2019    2019

Looking at altitude averages.

There was a very big outlier here in altitude, which makes me think it was a mistake. Noting that, I scaled down my plot. The median is a more reliable source of average altitude noting the strange outlier.

The median altitude that coffee was grown in was 1311 meters. That’s 4,301 foot elevation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1    1100    1311    1784    1600  190164     227
## [1] 4301.181

Beans are inspected for defects. Category one defects would be things like rotten beans. Category two defects are things like broken beans.

I noticed there were a lot more category two defects. This makes sense because category one defects should be zero and category two defects should be less than 5.

I created another column called Quality.Standards. This new variable will state TRUE if the beans have less than 0 category one defects, and less than 5 category two defects.

## 
## FALSE  TRUE 
##   418   893

Here is a plot that visually shows the result of the new column.


Univariate Analysis

What is the structure of your dataset?

There were 1311 rows of data and 44 columns initially. However, many of those columns were eliminated.

dim(coffeedata)
## [1] 1311   25

Quality Measure Meanings 10 best, 1 worst Scale by the SCAA (Speciality Coffee Association of America) Cupping is a process that involves roasting the coffee and simply brewing it by adding hot water to the ground beans

Aroma: +Aromatic aspects when infused with water Flavor +Taste and aroma, mid tones of coffee, based on flavor wheel Aftertaste +Duration of positive flavor attributes of coffee Acidity +Brightness (higher number) or sourness of coffee Body +Heaviness perceived on the tongue Balance +Overall rating of coffee Uniformity +Consistency of taste Cup Cleanliness +Transparency in the cup, should be free of off-flavors and defects Sweetness +Subtle pleasant sweetness in coffee Moisture +Should have a moisture content of 8 to 12.5%. Less/More will be low. Defects: +Info: Primary (e.g. black beans, sour beans) or Secondary (e.g. broken beans) +Primary: should have zero +Secondary: should have less than five SCAA Total Cup Points (100 point scale):
+90-100 - Outstanding +85-89.99 - Excellent +80-84.99 - Very Good +<80 - No scoring

Farm and Bean Data Country of Origin Farm Name Lot Number Mill Altitude Region Processing method Variety: +Type of arabica coffee *Color: +Grayish-blue, blue-green is most desirable, gradually dries in sun +Green is middle of the road +Green-brown scorched during drying or picked while under or over-ripe

What is/are the main feature(s) of interest in your dataset?

This dataset contains qualtiy measurements that I have yet to explore. These ratings are per the Speciality Coffee Association of America and are categorical tests to determine delicious coffee vs. not delicious.

I am curious about delicious coffee and how it correlates to other variables such as altitude and defects. I would like to have the formula down for excellent coffee! I love coffee!

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I know variety and processing of the individual beans also may will also be an important measure of quality beans. I also think that perhaps the origins of the coffee might be an interesting investigation.

Did you create any new variables from existing variables in the dataset?

I created a new variable called Quality.Standards. If the beans have no category one defects and less than 5 category two defects, then they are considered quality beans. When they meet quality standards, it will be marked as “TRUE”.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I noticed that with mean altitude there was some pretty strange outliers, one had a max of 190164 meters in altitude, which is impossible. I knew altitude wasn’t going to be an ongoing factor in the rest of my analysis so I chose not to remove it at this time.

I removed some columns from my dataset. The reason was either it had no relevance (such as Bag.Weight or Certification.Contact), or it had all the same data (Species), or had nearly no data in it.

I had to do some data cleaning especially with Harvest.Year. That field had all sorts of different formats of dates, and some had no years at all. I also changed it to Harvest.Year.Begin because it had both ranges (I’m assuming for harvest years that start at the end of the year and carry to the next). I also changed the Expiration date to be a year only rather than month day and year.


Bivariate Plots Section

Evaluating Origins of Beans vs. Bean Processing with Good Coffee!

Here are the items I will be looking at:

Origins of beans and how it effects quality. In other words, how does harvest year, country, and quality standards (bean defects) have a bearing on overall quality of the coffee per SCAA measurements?

Bean processing and how it effects quality. In other words, how does processing method, variety of bean, and the moisture of the bean have a bearing on overall quality of the coffee per SCAA measurements?

Exploring Origins of Beans

Now let’s take a look at how all the coffee origin information: Country.of.Origin, Harvest.Year.Begin, Quality.Standards and how that effects good coffee - which will be measured next to Total Cup Points.

The overall quality of the coffee test is Total Cup Points (Total.Cup.Points) and is as follows:

*SCAA Total Cup Points (100 point scale):
+90-100 - Outstanding +85-89.99 - Excellent +80-84.99 - Very Good +<80 - Below Grade

To help me in my analysis I’m going to create another column called Total.Cup.Result: Outstanding, Excellent, Very Good, Below Grade. I think it will add quick clarity and understanding to the visualizations.

##     Total.Cup.Points Total.Cup.Result
## 2              89.92        Excellent
## 3              89.75        Excellent
## 4              89.00        Excellent
## 5              88.83        Excellent
## 6              88.83        Excellent
## 7              88.75        Excellent
## 8              88.67        Excellent
## 9              88.42        Excellent
## 10             88.25        Excellent
## 11             88.08        Excellent
## 12             87.92        Excellent
## 13             87.92        Excellent
## 14             87.92        Excellent
## 15             87.83        Excellent
## 16             87.58        Excellent
## 17             87.42        Excellent
## 18             87.33        Excellent
## 19             87.25        Excellent
## 20             87.25        Excellent
## 21             87.25        Excellent
## 22             87.17        Excellent
## 23             87.17        Excellent
## 24             87.08        Excellent
## 25             87.08        Excellent
## 26             86.92        Excellent
## 27             86.92        Excellent
## 28             86.83        Excellent
## 29             86.67        Excellent
## 30             86.58        Excellent
## 31             86.58        Excellent
## 32             86.50        Excellent
## 33             86.42        Excellent
## 34             86.33        Excellent
## 35             86.25        Excellent
## 36             86.25        Excellent
## 37             86.25        Excellent
## 38             86.25        Excellent
## 39             86.25        Excellent
## 40             86.17        Excellent
## 41             86.17        Excellent
## 42             86.17        Excellent
## 43             86.17        Excellent
## 44             86.08        Excellent
## 45             86.08        Excellent
## 46             86.08        Excellent
## 47             86.00        Excellent
## 48             86.00        Excellent
## 49             86.00        Excellent
## 50             86.00        Excellent
## 51             86.00        Excellent
## 52             86.00        Excellent
## 53             85.92        Excellent
## 54             85.92        Excellent
## 55             85.92        Excellent
## 56             85.83        Excellent
## 57             85.83        Excellent
## 58             85.83        Excellent
## 59             85.83        Excellent
## 60             85.75        Excellent
## 61             85.75        Excellent
## 62             85.75        Excellent
## 63             85.58        Excellent
## 64             85.58        Excellent
## 65             85.58        Excellent
## 66             85.50        Excellent
## 67             85.50        Excellent
## 68             85.50        Excellent
## 69             85.50        Excellent
## 70             85.50        Excellent
## 71             85.42        Excellent
## 72             85.42        Excellent
## 73             85.42        Excellent
## 74             85.42        Excellent
## 75             85.42        Excellent
## 76             85.33        Excellent
## 77             85.33        Excellent
## 78             85.33        Excellent
## 79             85.33        Excellent
## 80             85.33        Excellent
## 81             85.33        Excellent
## 82             85.33        Excellent
## 83             85.33        Excellent
## 84             85.25        Excellent
## 85             85.25        Excellent
## 86             85.25        Excellent
## 87             85.17        Excellent
## 88             85.17        Excellent
## 89             85.08        Excellent
## 90             85.08        Excellent
## 91             85.08        Excellent
## 92             85.08        Excellent
## 93             85.08        Excellent
## 94             85.08        Excellent
## 95             85.08        Excellent
## 96             85.08        Excellent
## 97             85.00        Excellent
## 98             85.00        Excellent
## 99             85.00        Excellent
## 100            85.00        Excellent
## 101            85.00        Excellent
## 102            85.00        Excellent
## 103            85.00        Excellent
## 104            85.00        Excellent
## 105            85.00        Excellent
## 106            85.00        Excellent

Correlation between Defects of Beans and Total Cup Points

Result: The correlations are rather low, however, the category two does show a slightly higher correlation with total cup points then category one.

##                      Category.One.Defects Category.Two.Defects Total.Cup.Points
## Category.One.Defects            1.0000000            0.3422092       -0.1068260
## Category.Two.Defects            0.3422092            1.0000000       -0.2136031
## Total.Cup.Points               -0.1068260           -0.2136031        1.0000000

Country of Origin

The Total Cup Points by Country plot below gives a quick visual as to which countries have not only the most coffee samples, but also which have higher total cup points.

There are whole lot of outliers in the total cup points. While recognizing that outliers can distort statistical analysis, in this case I feel the outliers can be very informative about my subject-area of coffee being that coffee tasting results is so subjective. So, instead of removing the outliers, I chose to limit my x-axis and y-axis to help better visualize what’s going on, and I will compare median vs. mean to determine what would be a better measurement of “middle of the road” in future measurements.

Looking at the below data, it’s easier to see that Mexico, Columbia and Guatamala are the top producers.

## # A tibble: 37 x 5
##    Country.of.Origin         total_cup_mean total_cup__medi… total_cup_max     n
##    <chr>                              <dbl>            <dbl>         <dbl> <int>
##  1 Mexico                              80.9             81.6          87.2   236
##  2 Colombia                            83.1             83.2          86     183
##  3 Guatemala                           81.8             82.5          89.8   181
##  4 Brazil                              82.4             82.4          88.8   132
##  5 Taiwan                              82.0             82            86.6    75
##  6 United States (Hawaii)              81.8             82.8          87.9    73
##  7 Honduras                            79.4             81.7          86.7    53
##  8 Costa Rica                          82.8             83.2          87.2    51
##  9 Ethiopia                            85.5             85.2          90.6    44
## 10 Tanzania, United Republi…           82.4             82.2          86.5    40
## # … with 27 more rows

The below information shows the countries with the highest median of Total Cup Points.

## # A tibble: 37 x 5
##    Country.of.Origin total_cup_mean total_cup__median total_cup_max     n
##    <chr>                      <dbl>             <dbl>         <dbl> <int>
##  1 United States               86.0              87.2          87.9     8
##  2 Papua New Guinea            85.8              85.8          85.8     1
##  3 Ethiopia                    85.5              85.2          90.6    44
##  4 Japan                       84.7              84.7          84.7     1
##  5 Kenya                       84.3              84.6          86.2    25
##  6 Panama                      83.7              84.1          85.8     4
##  7 Uganda                      84.1              83.9          86.8    26
##  8 Ecuador                     83.8              83.8          83.8     1
##  9 Colombia                    83.1              83.2          86     183
## 10 Costa Rica                  82.8              83.2          87.2    51
## # … with 27 more rows

Harvest Year

We already know that 2012 was the greatest production year for coffee (between 2008 and 2018)

Which years appear to have the best total cup points? We can see from below that the best total cup points is 2014. 2012 definitely produced the most coffee. But most of the coffee production appeared below the mean line. So, it appears 2014 was not only a good year for coffee production wise, but it created some good quality coffee!

Exploring Bean Type and Processing

Now let’s take a look at the bean processing information: Variety, Moisture, Processing.Method

Let’s look how the moisture in the beans effects coffee tastes. With the moisture measure, it should have a moisture content of .08 (8%) to .12 (12%). Less/More will be considered to dry/wet for coffee standards.

When I first ran this, it has a low outlier. So I zoomed in to the Total Cup Points that are from 60 - 80. So I can get a better look. I also added a line measurement to get a good look to see if there was a correlation to increased moisture and good tasting coffee.

From the moisture visualization, it appears there is almost no correlation between moisture and total cup points. I verified that with the correlation test below.

## 
##  Pearson's product-moment correlation
## 
## data:  coffeedata$Total.Cup.Points and coffeedata$Moisture
## t = -4.5638, df = 1309, p-value = 5.495e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17808326 -0.07149403
## sample estimates:
##        cor 
## -0.1251497

By changing the visualization to a scatter plot and changing the alpha level, I could produce a sort of pseudo-heat map. This gives me better idea of how moisture effects good coffee.

According to the below and above information, the average moisture count is at about .08. And the majority of the data falls between .09 and .12. So it appears that the best coffee exists between .10 and .12. It seems that moisture makes an impact, it can’t be too much, and can’t be too little.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.09000 0.11000 0.08886 0.12000 0.28000

Next, it’s time to look at the variety of coffee. I grouped the data together by Variety and then summarized the mean and median.

## # A tibble: 6 x 4
##   Variety       total_cup_mean total_cup_median     n
##   <chr>                  <dbl>            <dbl> <int>
## 1 Arusha                  82.2             82.4     5
## 2 Blue Mountain           82.1             82.1     2
## 3 Bourbon                 81.9             82.3   226
## 4 Catimor                 83.3             83.2    20
## 5 Catuai                  81.3             81.9    74
## 6 Caturra                 82.4             83.1   256

In terms of discovering Variety, I’m pretty pleased. I can now tell just from these visualizations what are the three top tasting varieties of coffee beans.

  1. Ethiopian Heirlooms - Ethiopia
  2. Sumatra Lintrong - Indonesia
  3. SL34 - Mostly found in Kenya

The visualization supports my finding that Ethiopia produces some of the best tasting coffee (rated number 3). Kenya is just south of Ethiopia.

Processing method is the next item to look at.

## # A tibble: 6 x 5
##   Processing.Method         total_cup_mean total_cup_median total_cup_max     n
##   <chr>                              <dbl>            <dbl>         <dbl> <int>
## 1 Natural / Dry                       82.4             82.8          89     251
## 2 Other                               81.3             81.8          84.7    26
## 3 Pulped natural / honey              82.8             82.7          86.6    14
## 4 Semi-washed / Semi-pulped           82.6             82.5          86.1    56
## 5 Washed / Wet                        82.0             82.4          90.6   812
## 6 <NA>                                82.4             83.1          89.8   152

Inspired by another coffee lover, I also included this data so you can get an idea of what countries use what methods!

##                               
##                                Natural / Dry Other Pulped natural / honey
##   Brazil                                  80     1                      7
##   Burundi                                  0     0                      0
##   China                                    3     0                      1
##   Colombia                                27     0                      0
##   Costa Rica                               0     1                      2
##   Cote d?Ivoire                            0     0                      0
##   Ecuador                                  1     0                      0
##   El Salvador                              1     0                      0
##   Ethiopia                                17     0                      0
##   Guatemala                               10     2                      0
##   Haiti                                    1     0                      0
##   Honduras                                14     0                      0
##   India                                    1     0                      0
##   Indonesia                                2     4                      0
##   Japan                                    0     0                      1
##   Kenya                                    2     0                      0
##   Laos                                     0     0                      0
##   Malawi                                   0     0                      0
##   Mauritius                                0     0                      0
##   Mexico                                  17     0                      0
##   Myanmar                                  2     1                      0
##   Nicaragua                                4     3                      0
##   Panama                                   1     1                      0
##   Papua New Guinea                         0     0                      0
##   Peru                                     0     0                      0
##   Philippines                              1     0                      0
##   Rwanda                                   0     0                      0
##   Taiwan                                  13     9                      2
##   Tanzania, United Republic Of             1     0                      0
##   Thailand                                 2     0                      1
##   Uganda                                   7     0                      0
##   United States                            1     1                      0
##   United States (Hawaii)                  40     0                      0
##   United States (Puerto Rico)              0     0                      0
##   Vietnam                                  3     3                      0
##   Zambia                                   0     0                      0
##                               
##                                Semi-washed / Semi-pulped Washed / Wet
##   Brazil                                              24            6
##   Burundi                                              0            1
##   China                                                0           12
##   Colombia                                             0          121
##   Costa Rica                                           1           45
##   Cote d?Ivoire                                        0            1
##   Ecuador                                              0            0
##   El Salvador                                          1           15
##   Ethiopia                                             0            8
##   Guatemala                                            0          161
##   Haiti                                                0            4
##   Honduras                                             0           35
##   India                                                0            0
##   Indonesia                                            5            6
##   Japan                                                0            0
##   Kenya                                                0           22
##   Laos                                                 0            3
##   Malawi                                               0           11
##   Mauritius                                            0            0
##   Mexico                                              14          198
##   Myanmar                                              0            5
##   Nicaragua                                            0           11
##   Panama                                               0            2
##   Papua New Guinea                                     0            1
##   Peru                                                 0            8
##   Philippines                                          0            4
##   Rwanda                                               0            1
##   Taiwan                                               9           37
##   Tanzania, United Republic Of                         1           37
##   Thailand                                             0           18
##   Uganda                                               1           18
##   United States                                        0            6
##   United States (Hawaii)                               0            9
##   United States (Puerto Rico)                          0            4
##   Vietnam                                              0            1
##   Zambia                                               0            1

On average, the Processing Method that produces the highest total cup points is the “Pulped natural/honey” method.

In case you are wondering:

The pulped natural / honey process begin the drying process directly after de-pulping rather than undergoing fermentation to remove the mucilage. “Pulped natural” tends to have more fruit and fermented flavors because the bean has more time to interact with the natural sugars from the cherry as enzymes break down the mucilage around the bean. If producers however aren’t careful about stirring and watching, funky flavors will emerge in the roasted coffee.

However, Washed / Wet coffee’s are known for their vibrant notes. Removing all of the cherry prior to drying allows the intrinsic flavors of the bean to shine without anything holding them back. Fruit notes are still found in washed coffees, however, fermented notes and berry notes are less common.

Natural / Dry method involves drying coffee cherries either patios or raised beds in the sun. This process only works in areas that are hot and dry and take to give the coffee a more fruity flavor.

I guess it’s a matter of taste!


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I noticed that the U.S. coffee was rated the best tasting, followed by Papua New Guinea. Ethiopia was third and the best variety of coffee also was from Ethiopia (Kenya was just below it).

I expected to see more of a correlation, or at least an upward trending of moisture vs. total cup points where the more moisture the higher the cup points. However, I instead noticed that there was more of a pattern with a range of data using the scatter plot that produced the best coffee. It apparently can’t be too wet or dry.

For both variety and processing method, it was easier to see information when grouping the data and then looking at average, median and max. On average, the processing that produces the highest total cup points is the “Pulped natural/honey” method. However the max total cup point was the “Washed/wet method”.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Observing total cup points by harvest there was an interesting relationship with the quality standards which is the bean inspection. True indicates it passed quality standards for bean inspection. The Total Cup Points were less for those that didn’t pass quality standards.

What was the strongest relationship you found?

The strongest relationship I found seemed to be the defects and total cup points. The lower quality standard of beans yields less delicious coffee. Makes sense!


Multivariate Plots Section

Now it’s time to look at the data a little more closely. We’ll be adding some additional variables to our plots as well as looking at individual SCAA tests.

Harvest Year, Total Cup Points, Quality Standards

The lines provide a good quick glance at where the averages lie for Total Cup Points for coffee that met quality standards and those who didn’t. It’s interesting how the mean for total cup points plummeted for 2011. My thought is because it had some very low outliers or incorrect data which pulled it down. However, for that line to stay below the “met quality standards” is what I would expect.

Mexico was down low on the list of “excellent coffee” measured by total cup points despite being one of the highest producers. Is that because they don’t meet quality standards during the bean inspection process? Surprisingly no! Most of their beans appear to meet inspection.

That brings a suspicion to mind that most bad tasting coffee may be more related to the variety or processing of the beans than the picking of the correct beans. This supports the earlier plots I evaluated.

Country, Total Cup Points, Quality Standards

Total Cup Points, Quality Standards and Average Cup Point

This plot shows that beans with higher quality standards do effect good coffee, but not by much. However there are a lot of outliers in the true category.

Top 5 Countries, Category One Defects, and Total Cup Result

This supports the idea that the less defects the better the coffee to some degree. But if you look at Category Two, you can see with the U.S. and Kenya that’s not always the case.

Top 5 Countries, Category One Defects, and Total Cup Result

Early on the processing method wasn’t reported (see NA below). Most of the top countries tend to favor either natural/dry or washed/wet with Japan picking pulped natural methods mostly. Later Kenya changed to the washed/wet method. The natural/dry is most effective in dryer climates. The natural/dry methods are the coffees that tend to have stronger fruit notes. The washed/wet method will bring out the non fruit notes because the fruit is removed. The change in processing could be because of weather, but may be because they wanted variety with the coffee beans.

Matrix of the Relationship of the Testing Data

There are a variety of tests and measures that are used to determine total cup points. The below matrix shows the correlation of all the testing measures.

##   Aroma Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup Sweetness
## 1  8.67   8.83       8.67    8.75 8.50    8.42         10        10        10
## 2  8.75   8.67       8.50    8.58 8.42    8.42         10        10        10
## 3  8.42   8.50       8.42    8.42 8.33    8.42         10        10        10
## 4  8.17   8.58       8.42    8.42 8.50    8.25         10        10        10
## 5  8.25   8.50       8.25    8.50 8.42    8.33         10        10        10
## 6  8.58   8.42       8.42    8.50 8.25    8.33         10        10        10

These are all the items that go into the testing to determine the total cup points. Uniformity is a manner of testing between coffee tastes, clean cup is how it looks so it make sense that those items aren’t strongly correlated. The others all have to do with taste, so I can see that they are all strongly correlated. I am surprised by sweetness though.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I originally was thinking that defects didn’t matter so much with overall total cup points, However, the more I investigated, the more it strengthened the idea that defects did negatively effect total cup points.

Were there any interesting or surprising interactions between features?

I found it interesting at the strong correlation in the testing that went into total cup points, and how they correlated with each other. The “viewing” tests were not correlated with the other tests, which is what I would expect.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The Average Total Cup Points by Coffee Variety give a quick look as to what types of variety of beans to look for when finding good coffee. Higher on the scale for the variety shows the higher quality coffee. I found the results didn’t differ much using the median.

Plot Two

Description Two

The Total Cup Points by Country show not only which country has higher total cup points, but also demonstrates that defects in the coffee do effect the taste and which country tends to have fewer defects. Even though when you buy coffee you won’t necessarily know which beans had defects. BUT it is useful to know, which countries tend to have less defects. This will effect the taste of coffee. For example, note that Ethiopian coffee had few defects.

Plot Three

Description Three

The Processing Method of Top Countries show all the different processing methods. The processing methods are all a matter of taste. For example, if you prefer a fruity coffee then you’d lean toward a coffee that uses the Natural/Dry method where the fruit dries on the bean. It also shows the Harvest Year that those processes occurred.


Reflection

This dataset had 1311 observations.

Initially I noticed there wasn’t much of a correlation between defects and good tasting coffee as well as moisture and good tasting coffee. But on more careful observation, I realized that when I compared the total cup points with country, and then added the third variable I created, quality standards, you could see that quality standards does effect the taste of coffee.

Also, I noticed that with moisture, there was no correlation from a numbers aspect. However if you look at it on the plot, good coffee occurs at a certain range of moisture.

This makes me realize that processing makes an impact on flavor of coffee more than I initially thought. However how is one to really know what processing is used? So, that’s where origins and patterns come in to play. The top five “total cup point on average” countries tend to be: 1. United States 2. Papua New Guinea, 3. Ethiopia, 4. Japan and 5. Kenya. It’s easy to say that United States has the best coffee, but I’m not sure I agree.

United States coffee has a lot of defects, but Ethiopia had the max total coffee points. Also the top variety of coffee was a Ethiopian variety. It leads me to believe that Ethiopian coffee is the true hero. So what about Papua New Guinea? Wasn’t that second? Yes. But, I also noticed a lot of their data reporting was blank, so it’s kind of an unknown. However, you can’t ignore the fact that taste tests show it is on average the second best tasting coffee. However, without consistent data you can make predictions for the future.

Also, it is a matter of taste and opinion. For example those that process with the Natural/Dry method will have less of a fruity taste.

One future work on this dataset would be to fill in a lot of the empty data. Also a lot of items such as Farm, Mill, and color of bean were missing, which may have made a big impact on how to find good coffee.